Today we’ll be using SelectorGadget, which is a Chrome extension that makes it easy to discover CSS selectors. (Install the extension directly here.) Please note that SelectorGadget is only available for Chrome. If you prefer using Firefox, then you can try ScrapeMate.
rvest, janitortidyverse, lubridate, hrbrthemesRecall that rvest was automatically installed with the rest of the tidyverse. So you only need to install the small janitor package:
The next two lectures are about getting data, or “content”, off the web and onto our computers. We’re all used to seeing this content in our browers (Chrome, Firefox, etc.). So we know that it must exist somewhere. However, it’s important to realise that there are actually two ways that web content gets rendered in a browser:
You can read here for more details (including example scripts), but for our purposes the essential features are as follows:
Over the next week, we’ll use these lecture notes — plus some student presentations — to go over the main differences between the two approaches and cover the implications for any webscraping activity. I want to forewarn you that webscraping typically involves a fair bit of detective work. You will often have to adjust your steps according to the type of data you want, and the steps that worked on one website may not work on another. (Or even work on the same website a few months later). All this is to say that webscraping involves as much art as it does science.
The good news is that both server-side and client-side websites allow for webscraping.1 If you can see it in your browser, you can scrape it.
The previous sentence elides some important ethical and legal considerations. Just because you can scrape it, doesn’t mean you should. It is ultimately your responsibility to determine whether a website maintains legal restrictions on the content that it provides. Similarly, the tools that we’ll be using are very powerful. It’s fairly easy to write up a function or program that can overwhelm a host server or application through the sheer weight of requests. A computer can process commands much, much faster than we can ever type them up manually. We’ll come back to the “be nice” motif in the next lecture.
There’s also new package called polite, which aims to improve web etiquette. I’ll come back to it again briefly in the Further resources and exercises section at the bottom of this document.
rvest (server-side)The primary R package that we’ll be using today is Hadley Wickham’s rvest. Let’s load it now.
rvest is a simple webscraping package inspired by Python’s Beautiful Soup, but with extra tidyverse functionality. It is also designed to work with webpages that are built server-side and thus requires knowledge of the relevant CSS selectors… Which means that now is probably a good time for us to cover what these are.
Time for a student presentation on CSS (i.e Cascading Style Sheets) and SelectorGadget. Click on the links if you are reading this after the fact. In short, CSS is a language for specifying the appearance of HTML documents (including web pages). It does this by providing web browsers a set of display rules, which are formed by:
The key point is that if you can identify the CSS selector(s) of the content you want, then you can isolate it from the rest of the webpage content that you don’t want. This where SelectorGadget comes in. We’ll work through an extended example (with a twist!) below, but I highly recommend looking over this quick vignette from Hadley before proceding.
Okay, let’s get to an application. Say that we want to scrape the Wikipedia page on the Men’s 100 metres world record progression.
First, open up this page in your browser. Take a look at its structure: What type of objects does it contain? How many tables does it have? Etc.
Once you’ve familised yourself with the structure, read the whole page into R using the rvest::read_html() function.
wp <- "http://en.wikipedia.org/wiki/Men%27s_100_metres_world_record_progression"
m100 <-
wp %>%
read_html()
m100## {xml_document}
## <html class="client-nojs" lang="en" dir="ltr">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
## [2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-sub ...
As you can see, this is an XML document2 that contains everything needed to render the Wikipedia page. It’s kind of like viewing someone’s entire LaTeX document (preamble, syntax, etc.) when all we want are the data from some tables in their paper.
Let’s try to isolate the first table on the page, which documents the unofficial progression before the IAAF. As per the rvest vignette, we can use rvest::html_nodes() to isolate and extract this table from the rest of the HTML document by providing the relevant CSS selector. We should then be able to convert it into a data frame using rvest::html_table(). I also recommend using the fill=TRUE option here, because otherwise we’ll run into formatting problems because of row spans in the Wiki table.
I’ll use SelectorGadget to identify the CSS selector. In this case, I get “div+ .wikitable :nth-child(1)”, so let’s check if that works.
## Error in html_table.xml_node(X[[i]], ...): html_name(x) == "table" is not TRUE
Uh-oh! It seems that we immediately run into an error. I won’t go into details here, but we have to be cautious with SelectorGadget sometimes. It’s a great tool and usually works perfectly. However, occasionally what looks like the right selection (i.e. the highlighted stuff in yellow) is not exactly what we’re looking for. I deliberately chose this Wikipedia 100m example because I wanted to showcase this potential pitfall. Again: Webscraping is as much art as it is science.
Fortunately, there’s a more precise way of determing the right selectors using the “inspect web element” feature that available in all modern browsers. In this case, I’m going to use Google Chrome (either right click and then “Inspect”, or Ctrl+Shift+I). I proceed by scrolling over the source elements until Chrome highlights the table of interest. Then right click and Copy -> Copy selector. Here’s a GIF animation of these steps: